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Abstract —Three important properties of a classification ma¬ 
chinery are: (i) the system preserves the core information of the 
input data; (ii) the training examples convey information about 
unseen data; and (iii) the system is able to treat differently points 
from different classes. In this work we show that these funda¬ 
mental properties are satisfied by the architecture of deep neural 
networks. We formally prove that these networks with random 
Gaussian weights perform a distance-preserving embedding of 
the data, with a special treatment for in-class and out-of-class 
data. Similar points at the input of the network are likely to 
have a similar output. The theoretical analysis of deep networks 
here presented exploits tools used in the compressed sensing 
and dictionary learning literature, thereby making a formal 
connection between these important topics. The derived results 
allow drawing conclusions on the metric learning properties 
of the network and their relation to its structure, as well as 
providing bounds on the required size of the training set such that 
the training examples would represent faithfully the unseen data. 
The results are validated with state-of-the-art trained networks. 


I. Introduction 

Deep neural networks (DNN) have led to a revolution in 
the areas of machine learning, audio analysis, and computer 
vision, achieving state-of-the-art results in numerous applica¬ 
tions 12,1111,13]. In this work we formally study the properties 
of deep network architectures with random weights applied 
to data residing in a low dimensional manifold. Our results 
provide insights into the outstanding empirically observed 
performance of DNN, the role of training, and the size of 
the training data. 

Our motivation for studying networks with random weights 
is twofold. First, a series of works a, a, a empirically 
showed successful DNN learning techniques based on ran¬ 
domization. Second, studying a system with random weights 
rather than learned deterministic ones may lead to a better 
understanding of the system even in the deterministic case. For 
example, in the field of compressed sensing, where the goal is 
to recover a signal from a small number of its measurements, 
the study of random sampling operators led to breakthroughs 
in the understanding of the number of measurements required 
for achieving a stable reconstruction El- While the bounds 
provided in this case are universally optimal, the introduction 
of a learning phase provides a better reconstruction perfor¬ 
mance as it adapts the system to the particular data at hand 


0, El, ESI. In the field of information retrieval, random pro¬ 
jections have been used for locality-sensitive hashing (LSH) 
scheme capable of alleviating the curse of dimensionality for 
approximate nearest neighbor search in very high dimensions 
E2. While the original randomized scheme is seldom used in 
practice due to the availability of data-specific metric learning 
algorithms, it has provided many fruitful insights. Other fields 
such as phase retrieval, gained significantly from a study based 
on random Gaussian weights Ea. 

Notice that the technique of proving results for deep learn¬ 
ing with assumptions on some random distribution and then 
showing that the same holds in the more general case is not 
unique to our work. On the contrary, some of the stronger 
recent theoretical results on DNN follow this path. For exam¬ 
ple, Arora et al. analyzed the learning of autoencoders with 
random weights in the range [—1,1], showing that it is possible 
to learn them in polynomial time under some restrictions on 
the depth of the network E3. Another example is the series of 
works El, El, El that study the optimization perspective 
of DNN. 

In a similar fashion, in this work we study the properties of 
deep networks under the assumption of random weights. Be¬ 
fore we turn to describe our contribution, we survey previous 
studies that formally analyzed the role of deep networks. 

Hornik et al. E3 and Cybenko ifTSll proved that neural net¬ 
works serve as a universal approximation for any measurable 
Borel functions. However, finding the network weights for a 
given function was shown to be NP-hard. 

Bruna and Mallat proposed the wavelet scattering 
transform- a cascade of wavelet transform convolutions with 
nonlinear modulus and averaging operators El- They showed 
for this deep architecture that with more layers the resulting 
features can be made invariant to increasingly complex groups 
of transformations. The study of the wavelet scattering trans¬ 
form demonstrates that deeper architectures are able to better 
capture invariant properties of objects and scenes in images 

Anselmi et al. showed that image representations invari¬ 
ant to transformations such as translations and scaling can 
considerably reduce the sample complexity of learning and 
that a deep architecture with filtering and pooling can learn 
such invariant representations ll20l . This result is particularly 
important in cases that training labels are scarce and in totally 
unsupervised learning regimes. 
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Montufar and Morton showed that the depth of DNN allows 
representing restricted Boltzmann machines with a number 
of parameters exponentially greater than the number of the 
network parameters M- Montufar et al. suggest that each 
layer divides the space by a hyper-plane ||22l. Therefore, a 
deep network divides the space into an exponential number of 
sets, which is unachievable with a single layer with the same 
number of parameters. 

Bruna et al. showed that the pooling stage in DNN results 
in shift invariance i23i . In ll24ll . the same authors interpret 
this step as the removal of phase from a complex signal and 
show how the signal may be recovered after a pooling stage 
using phase retrieval methods. This work also calculates the 
Lipschitz constants of the pooling and the rectified linear unit 
(ReLU) stages, showing that they perform a stable embedding 
of the data under the assumption that the filters applied in the 
network are frames, e.g., for the ReLU stage there exist two 
constants 0 < A < B such that for any x, y G R", 

A ||x - y ||2 < IIp(Mx) - p(My )||2 < ^ ||x - y ||2 , (1) 

where M G denotes the linear operator at a given layer 

in the network with m and n denoting the input and output 
dimensions, respectively, and p{x) = max(0,x) is the ReLU 
operator applied element-wise. However, the values of the Lip¬ 
schitz constants A and B in real networks and their behavior as 
a function of the data dimension currently elude understanding. 
To see why such a bound may be very loose, consider the 
output of only the linear part of a fully connected layer with 
random i.i.d. Gaussian weights, rriij ~ iV(0, 1/i/to), which 
is a standard initialization in deep learning. In this case, A 
and B scale like (l ± a/^) respectively 1251. This 
undesired behavior is not unique to a normally-distributed M, 
being characteristic of any distribution with a bounded fourth 
moment. Note that the addition of the non-linear operators, the 
ReLU and the pooling, makes these Lipschitz constants even 
worse. 

A. Contributions 

As the former example teaches, the scaling of the data intro¬ 
duced by M may drastically deform the distances throughout 
each layer, even in the case where m is very close to n, which 
makes it unclear whether it is possible to recover the input of 
the network (or of each layer) from its output. In this work, the 
main question we focus on is: What happens to the metric of 
the input data throughout the network? We focus on the above 
mentioned setting, assuming that the network has random i.i.d. 
Gaussian weights. We prove that DNN preserve the metric 
structure of the data as it propagates along the layers, allowing 
for the stable recovery of the original data from the features 
calculated by the network. 

This type of property is often encountered in the literature 
1241 . 1261. Notice, however, that the recovery of the input is 
possible if the size of the network output is proportional to 
the intrinsic dimension of the data at the input (which is not 
the case at the very last layer of the network, where we have 
class labels only), similarly to data reconstruction from a small 


number of random projections iTTl . l28l . l29l . However, un¬ 
like random projections that preserve the Euclidean distances 
up to a small distortion 1^ . each layer of DNN with random 
weights distorts these distances proportionally to the angles 
between its input points: the smaller the angle at the input, the 
stronger the shrinkage of the distances. Therefore, the deeper 
the network, the stronger the shrinkage we get. Note that this 
does not contradict the fact that we can recover the input from 
the output; even when properties such as lighting, pose and 
location are removed from an image (up to certain extent), the 
resemblance to the original image is still maintained. 

As random projection is a universal sampling strategy for 
any low dimensional data Q, EH, m, deep networks with 
random weights are a universal system that separates any 
data (belonging to a low dimensional model) according to 
the angles between its points, where the general assumption 
is that there are large angles between different classes 1^ . 
m, m, ESI. As training of the projection matrix adapts it 
to better preserve specific distances over others, the training 
of a network prioritizes intra-class angles over inter-class 
ones. This relation is alluded by our proof techniques and 
is empirically manifested by observing the angles and the 
Euclidean distances at the output of trained networks, as 
demonstrated later in the paper in Section |Vll 

The rest of the paper is organized as follows: In Section HU 
we start by utilizing the recent theory of I-bit compressed 
sensing to show that each DNN layer preserves the metric of 
its input data in the Gromov-Hausdorff sense up to a small 
constant i5, under the assumption that these data reside in 
a low-dimensional manifold denoted by K. This allows us 
to draw conclusions on the tessellation of the space created 
by each layer of the network and the relation between the 
operation of these layers and local sensitive hashing (LSH) 
d. We also show that it is possible to retrieve the input of a 
layer, up to certain accuracy, from its output. This implies that 
every layer preserves the important information of the data. 

In Section [Till we proceed by analyzing the behavior of 
the Euclidean distances and angles in the data throughout the 
network. This section reveals an important effect of the ReLU. 
Without the ReLU, we would just have random projections 
and Euclidean distance preservation. Our theory shows that 
the addition of ReLU makes the system sensitive to the angles 
between points. We prove that networks tend to decrease the 
Euclidean distances between points with a small angle between 
them (“same class”), more than the distances between points 
with large angles between them (“different classes”). 

Then, in Section |IV] we prove that low-dimensional data at 
the input remain such throughout the entire network, i.e., DNN 
(almost) do not increase the intrinsic dimension of the data. 
This property is used in Section |V] to deduce the size of data 
needed for training DNN. 

We conclude by studying the role of training in Section [Vl] 
As random networks are blind to the data labels, training 
may select discriminatively the angles that cause the distance 
deformation. Therefore, it will cause distances between dif¬ 
ferent classes to increase more than the distances within the 
same class. We demonstrate this in several simulations, some 
of which with networks that recently showed state-of-the-art 
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performance on challenging datasets, e.g., the network by ll^ 
for the ImageNet dataset llJTl . Section lVlIl concludes the paper 
by summarizing the main results and outlining future research 
directions. 

It is worthwhile emphasizing that the assumption that 
classes are separated by large angles is common in the liter¬ 
ature (see m, 123, IMI, 1351). This assumption can further 
refer to some feature space rather than to the raw data space. 
Of course, some examples might be found that contradict this 
assumption such as the one of two concentric spheres, where 
each sphere represents a different class. With respect to such 
particular examples two things should be said; First, these 
cases are rare in real life signals, typically exhibiting some 
amount of scale invariance that is absent in this example. 
Second, we prove that the property of discrimination based on 
angles holds for DNN with random weights and conjecture in 
Section |VT] that a potential role (or consequence) of training 
in DNN is to favor certain angles over others and to select 
the origin of the coordinate system with respect to which the 
angles are calculated. We illustrate in Section |VT] the effect of 
training compared to random networks, using trained DNN 
that have achieved state-of-the-art results in the literature. 
Our claim is that DNN are suitable for models with clearly 
distinguishable angles between the classes if random weights 
are used, and for classes with some distinguishable angles 
between them if training is used. 

For the sake of simplicity of the discussion and presentation 
clarity, we focus only on the role of the ReLU operator ll38l . 
assuming that our data are properly aligned, i.e., they are 
not invariant to operations such as rotation or translation, and 
therefore there is no need for the pooling operation to achieve 
invariance. Combining recent results for the phase retrieval 
problem with the proof techniques in this paper can lead to 
a theory also applicable to the pooling operation. In addition, 
with the strategy in ED, iQl, ED, ED, it is possible to 
generalize our guarantees to sub-Gaussian distributions and to 
random convolutional filters. We defer these natural extensions 
to future studies. 

II. Stable Embedding oe a Single Layer 

In this section we consider the distance between the metrics 
of the input and output of a single DNN layer of the form 
/(M-), mapping an input vector x to the output vector 
/(Mx), where M is an m x n random Gaussian matrix 
and f : M. ^ M. is a semi-truncated linear function applied 
element-wise. / is such if it is linear on some (possibly, 
semi-infinite) interval and constant outside of it, /(O) = 0, 
0 < f{x) < x^x > 0, and 0 > f{x) > x^Mx < 0. 
The ReLU, henceforth denoted by p, is an example of such 
a function, while the sigmoid and the hyperbolic tangent 
functions satisfy this property approximately. 

We assume that the input data belong to a manifold K with 
the Gaussian mean width 

uj{K) :=E[ sup (g,x-y)], (2) 

x.yeA 

where the expectation is taken over a random vector g with 
normal i.i.d. elements. To better understand this definition. 



Fig. 1. Width of the set K in the direction of g. This figure is a variant of 
Fig. 1 in (m. Used with permission of the authors. 

note that sup^ yGifiSi ^ ~ y) is the width of the set K in the 
direction of g as illustrated in Lig. [D The mean provides an 
average over the widths of the set K in different isotropically 
distributed directions, leading to the definition of the Gaussian 
mean width uj{K). 

The Gaussian mean width is a useful measure for the 
dimensionality of a set. As an example we consider the 
following two popular data models; 

a) Gaussian mixture models: AT C (the 2 -ball) 
consists of L Gaussians of dimension k. In this case uj{K) = 
0{s/k -f log A). 

b) Sparsely representable signals: The data in iT C B^ 

can be approximated by a sparse linear combination of atoms 
of a dictionary, i.e., K = {x = Da ; ||a llo < ^>I|X||2 < 1}, 
where H-Hp is the pseudo-norm that counts the number of non¬ 
zeros in a vector and D S is the given dictionary. Lor 

this model uj{K) = 0{^k\og{L/k)). 

Similar results can be shown for other models such as 
union of subspaces and low dimensional manifolds. Lor more 
examples and details on uj{K), we refer the reader to 12^ . 

El- 

We now show that each standard DNN layer performs a 
stable embedding of the data in the Gromov-Hausdorff sense 
El, i.e., it is a i5-isometry between (AT, and {K',dK'), 
where AT and K' are the manifolds of the input and output 
data and dx and dx' are metrics induced on them. A function 
/i ; AT —>■ AT' is a d-isometry if 

\dK'{h{x),h{y)) -dx(x,y)| < (5,Vx,y e AT, (3) 

and for every x' G K' there exists x G AT such that 
dY{x',h{x)) < 6 (the latter property is sometimes called 
i5-surjectivity). In the following theorem and throughout the 
paper C denotes a given constant (not necessarily the same 
one) and the unit sphere in R”. 

Theorem 1: Let M be the linear operator applied at a 
DNN layer, / a semi-truncated linear activation function, and 
AT C the manifold of the input data for that layer. 

If G R™^” is a random matrix with i.i.d normally 

distributed entries, then the map g : (AT C 
{g{K),d’Bm) defined by p(x) = sgn(/(Mx)) is with high 
probability a b-isometry, i.e., 

Ms"-i(x,y) - dH-( 5 (x),g(y))| < <5, Vx,y G AT, 
with (5 < C 
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In the theorem dgn-i is the geodesic distance on 
dn is the Hamming distance and the sign function, sgn(-), is 
applied elementwise, and is defined as sgn(x) = 1 if a; > 0 
and sgn(x) = — 1 if x < 0. The proof of the approximate 
injectivity follows from the proof of Theorem 1.5 in ll43l . 

Theorem [T] is important as it provides a better understanding 
of the tessellation of the space that each layer creates. This 
result stands in line with that suggested that each layer 
in the network creates a tessellation of the input data by the 
different hyper-planes imposed by the rows in M. However, 
Theorem [T] implies more than that. It implies that each cell 
in the tessellation has a diameter of at most 5 (see also 
Corollary 1.9 in HSl l. i.e., if X and y fall to the same side 
of all the hyperplanes, then ||x —y ||2 < 5- In addition, the 
number of hyperplanes separating two points x and y in iT 
contains enough information to deduce their distance up to a 
small distortion. From this perspective, each layer followed 
by the sign function acts as locality-sensitive hashing HD, 
approximately embedding the input metric into the Hamming 
metric. 

Having a stable embedding of the data, it is natural to 
assume that it is possible to recover the input of a layer from 
its output. Indeed, Mahendran and Vedaldi demonstrate that it 
is achievable through the whole network ||26l. The next result 
provides a theoretical justification for this, showing that it is 
possible to recover the input of each layer from its output; 

Theorem 2: Under the assumptions of Theorem [T] there 
exists a program A such that ||x — ,A(/(Mx ))||2 < e, where 

The proof follows from Theorem 1.3 in 1^ . If iC is a cone 
then one may use also Theorem 2.1 in ll45l to get a similar 
result. 

Both theorems [D and |2] are applying existing results from 
1-bit compressed sensing on DNN. Theorem [T] deals with 
embedding into the Hamming cube and Theorem |2] uses this 
fact to show that we can recover the input from the output. 
Indeed, Theorem [T] only applies to an individual layer and 
cannot be applied consecutively throughout the network, since 
it deals with embedding into the Hamming cube. One way 
to deal with this problem is to extend the proof in ll43l 
to an embedding from to Instead, we turn to 

focus on the ReLU and prove more specific results about the 
exact distortion of angles and Euclidean distances. These also 
include a proof about a stable embedding of the network at 
each layer from R” to R'". 

III. Distance and Angle Distortion 

So far we have focused on the metric preservation of the 
deep networks in terms of the Gromov-Hausdorff distance. In 
this section we turn to look at how the Euclidean distances 
and angles change within the layers. We focus on the case of 
ReLU as the activation function. A similar analysis can also 
be applied for pooling. Eor the simplicity of the discussion we 
defer it to future study. 

Note that so far we assumed that K is normalized and lies 
on the sphere Given that the data at the input of the 

network lie on the sphere and we use the ReLU p as the 



Eig. 2. Behavior of y) (Eq. ID) as a function of cosZ(x, y). The 
two extremities cosZ(x,y) = dil correspond to the angles zero and tt, 
respectively, between x and y. Notice that the larger is the angle between 
the two, the larger is the value of '0(x, y). In addition, it vanishes for small 
angles between the two vectors. 



Fig. 3. Behavior of the angle between two points x and y in the output of a 
DNN layer as a function of their angle in the input. In the ranges [0,7r/4] and 
1^/4,7r/2] the output behaves approximately like 0.95Z(x, y) and 0.21 + 
^Z(x, y), respectively. 


activation function, the transformation p(M-) keeps the output 
data (approximately) on a sphere (with half the diameter, see 
orb in the proof of Theorem [3 in the sequel). Therefore, in 
this case the normalization requirement holds up to a small 
distortion throughout the layers. This adds a motivation for 
having a normalization stage at the output of each layer, which 
was shown to provide some gain in several DNN |[T|, ll46l . 

Normalization, which is also useful in shallow represen¬ 
tations HD, can be interpreted as a transformation making 
the inner products between the data vectors coincide with the 
cosines of the corresponding angles. While the bounds we 
provide in this section do not require normalization, they show 
that the operations of each layer rely on the angles between 
the data points. 

The following two results relate the Euclidean and angular 
distances in the input of a given layer to the ones in its output. 
We denote by B” C R" the Euclidean ball of radius r. 

Theorem 3: Let M be the linear operator applied at a given 
layer, p (the ReLU) be the activation function, and iT C B" be 
the manifold of the data in the input layer. If y/mM. € R™^" 
is a random matrix with i.i.d. normally distributed entries and 
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Fig. 4. Left: Histogram of the angles in the output of the 8-th (dashed-lines) and 16-th (continuous-lines) layers of the ImageNet deep network for different 
input angle ranges. Middle: Histogram of the ratio between the angles in the output of the 8-th (dashed-line) and 16-th (continuous-line) layers of this network 
and the angles in its input for different input angle ranges. Right: Histogram of the differences between the angles in the output of the 8-th (dashed-line) and 
16-th (continuous-line) layers of this network and the angles in its input for different input angles ranges. 


m > C6 then with high probability 

IIp(Mx) -p(My)||2 - 
1 „ 


l|x-y|| 


|x|l2 llylla V^(x,y) 


<( 5 , 


< TT and 


(4) 


where 0<Z(x,y) 4 cos-1 (fot) 

y) = ;^ ( sin Z(x, y) - Z(x, y) cos Z(x, y)) . 

Theorem 4: Under the same conditions of Theorem [3] and 
iT C li \ 1^, where (5 < /3^ < 1, with high probability 


(5) 


cosZ (p(Mx),p(My)) - 


( 6 ) 


(cosZ(x,y) +'0(x,y)) 


15(5 

- /32 -2(5' 


Remark 1 Ai we have seen in Section the assumption 
m > CS~^w{K)^ implies m = 0{k^) if K is a GMM and 
m = 0{k^ log L) if K is generated by k-sparse represen¬ 
tations in a dictionary D G As we shall see later 

in Theorem |6] in Section m it is enough to have the model 
assumption only at the data in the input of the DNN. Finally, 
the quadratic relationship between m and w{K)^ might be 
improved. 

We leave the proof of these theorems to Appendices and 
IbI and dwell on their implications. 

Note that if y) were equal to zero, then theorems |3] and 
Elwould have stated that the distances and angles are preserved 
(in the former case, up to a factor of 0. throughout the net¬ 
work. As can be seen in Fig.|2] '0(x, y) behaves approximately 
like 0.5(1 — cosZ(x, y)). The larger is the angle Z(x, y), 
the larger is '0(x, y), and, consequently, also the Euclidean 
distance at the output. If the angle between x and y is close 
to zero (i.e., cosZ(x, y) close to 1), '0(x, y) vanishes and 


'More specifically, we would have ||p(Mx) — p(My )||2 = ^ ||x — y ||2 
Notice that this is the expectation of the distance ||p(Mx — My)|| 2 . 


therefore the Euclidean distance shrinks by half throughout 
the layers of the network. We emphasize that this is not in 
contradiction to Theorem [T] which guarantees approximately 
isometric embedding into the Hamming space. While the 
binarized output with the Hamming metric approximately 
preserves the input metric, the Euclidean metric on the raw 
output is distorted. 

Considering this effect on the Euclidean distance, the 
smaller the angle between x and y, the larger the distortion 
to this distance at the output of the layer, and the smaller 
the distance turns to be. On the other hand, the shrinkage of 
the distances is bounded as can be seen from the following 
corollary of Theorem [3 

Corollary 5: Under the same conditions of Theorem[3 with 
high probability 

\ l|x - y|l 2 -5< IIp(Mx) - p(My)|l 2 < ||x - y ||2 + 5. (7) 

The proof follows from the inequality of arithmetic and 
geometric means and the behavior of '0(x, y) (see Eig. |2]). We 
conclude that DNN with random Gaussian weights preserve 
local structures in the manifold K and enable decreasing 
distances between points away from it, a property much 
desired for classification. 

The influence of the entire network on the angles is slightly 
different. Note that starting from the input of the second layer, 
all the vectors reside in the non-negative orthant. The cosine 
of the angles is translated from the range [—1,1] in the first 
layer to the range [0,1] in the subsequent second layers. In 
particular, the range [—1,0] is translated to the range [0, I/tt], 
and in terms of the angle Z(x,y) from the range [7r/2,7r] to 
[cos”^ (I/tt), 7r/2]. These angles shrink approximately by half, 
while the ones that are initially small remain approximately 
unchanged. 

The action of the network preserves the order between the 
angles. Generally speaking, the network affects the angles 
in the range [0,7r/2] in the same way. In particular, in the 
range [0,7r/4] the angles in the output of the layer behave 
like 0.95Z(x, y) and in the wider range [0,7r/2] they are 
bounded from below by 0.8Z(x, y) (see Eig. O. Therefore, 
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Fig. 5. Sketch of the distortion of two classes with distinguishable angle 
between them as obtained by one layer of DNN with random weights. These 
networks are suitable for separating this type of classes. Note that the distance 
between the blue and red points shrinks less than the distance between the 
red points as the angle between the latter is smaller. 

we conclude that the DNN distort the angles in the input 
manifold K similarly and keep the general configuration of 
angles between the points. 

To see that our theory captures the behavior of DNN 
endowed with pooling, we test how the angles change through 
the state-of-the-art 19-layers deep network trained in 1^ for 
the ImageNet dataset. We randomly select 3-10^ angles (pairs 
of data points) in the input of the network, partitioned to three 
equally-sized groups, each group corresponding to a one of the 
ranges [0,7r/4], [7r/4,7r/2] and [7r/2,7r]. We test their behavior 
after applying eight and sixteen non-linear layers. The latter 
case corresponds to the part of the network excluding the fully 
connected layers. We denote by x the vector in the output of 
a layer corresponding to the input vector x. Fig. |4] presents a 
histogram of the values of the angles Z(x, y) at the output of 
each of the layers for each of the three groups. It shows also 
the ratio Z(x,y)/Z(x, y) and difference Z(x,y) — Z(x, y), 
between the angles at the output of the layers and their original 
value at the input of the network. 

As Theorem m predicts, the ratio Z(x, y)/Z(x, y) corre¬ 
sponding to Z(x, y) G [7r/2,7r] is half the ratio corresponding 
to input angles in the range [0,7r/2]. Furthermore, the ratios in 
the ranges [0,7r/4] and [7r/4,7r/2] are approximately the same, 
where in the range [7r/4,7r/2] they are slightly larger. This is 
in line with Theorem |4] that claim that the angles in this range 
decrease approximately in the same rate, where for larger 
angles the shrink is slightly larger. Also note that according 
to our theory the ratio corresponding to input angles in the 
range [0,7r/4] should behave on average like 0.95'^, where q 
is the number of layers. Indeed, for q = 8, 0.95® = 0.66 and 
for q = 16, 0.95^® = 0.44; the centers of the histograms for 
the range [0,7r/4] are very close to these values. Notice that 
we have a similar behavior also for the range [7r,7r/2]. This 
is not surprising, as by looking at Fig. [3 one may observe 
that these angles also turn to be in the range that has the 
ratio 0.95. Remarkably, Fig. |4] demonstrates that the network 
keeps the order between the angles as Theorem |4] suggests. 
Notice that the shrinkage of the angles does not cause large 
angles to become smaller than other angles that were originally 
significantly smaller than them. Moreover, small angles in the 
input remain small in the output as can be seen in Fig. HJright). 

We sketch the distortion of two sets with distinguishable 
angle between them by one layer of DNN with random weights 
in Fig. |5] It can be observed that the distance between points 


with a smaller angle between them shrinks more than the 
distance between points with a larger angle between them. 
Ideally, we would like this behavior, causing points belonging 
to the same class to stay closer to each other in the output 
of the network, compared to points from different classes. 
However, random networks are not selective in this sense: if 
a point X forms the same angle with a point z from its class 
and with a point y from another class, then their distance will 
be distorted approximately by an equal amount. Moreover, the 
separation caused by the network is dependent on the setting of 
the coordinate system origin with respect to which the angles 
are calculated. The location of the origin is dependent on the 
bias terms b (in this case each layer is of the form p(Mx-f b)), 
which are set to zero in the random networks here studied. 
These are learned by the training of the network, affecting the 
angles that cause the distortions of the Euclidean (and angular) 
distances. We demonstrate the effect of training in Section [Vl] 

IV. Embedding of the Entire Network 

In order to show that the results in sections H and Hu] 
also apply to the entire network and not only to one layer, 
we need to show that the Gaussian mean width does not 
grow significantly as the data propagate through the layers. 
Instead of bounding the variation of the Gaussian mean width 
throughout the network, we bound the change in the covering 
number N^{K), i.e., the smallest number of f 2 -balls of radius 
e that cover K. Having the bound on the covering number, we 
use Dudley’s inequality li48l . 

poo _ 

uj{K) < C / ^/logN,iK)de, (8) 

Jo 

to bound the Gaussian mean width. 

Theorem 6: Under the assumptions of Theorem [T] 

(iT). (9) 

Proof: We divide the proof into two parts. In the first one, 
we consider the effect of the activation function / on the size 
of the covering, while in the second we examine the effect 
of the linear transformation M. Starting with the activation 
function, let xg G AT be a center of a ball in the covering of 
K and x G AT be a point that belongs to the ball of xg of 
radius e, i.e., ||x — xg ||2 < e. It is not hard to see that since a 
semi-truncated linear activation function shrinks the data, then 
ll/W - /(xc )||2 < IIX-X 0 II 2 < e and therefore the size of 
the covering does not increase (but might decrease). 

Eor the linear part we have that li49l Theorem 1.4] 

IIMX-MX 0 II 2 < ^1-f ||x-Xg||2. (10) 

Therefore, after the linear operation each covering ball with 
initial radius e is not bigger than (1 -|- )e. Since the 

activation function does not increase the size of the covering, 
we have that after a linear operation followed by an activation 
function, the size of the covering balls increases by a factor 
of (1 -b ). Therefore, the size of a covering with balls of 
radius e of the output data /(MA") is bounded by the size of 
a covering with balls of radius e/(l -|- ). D 
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Theorem |6] generalizes the results in theorems ^ |3] and 
m such that they can be used for the whole network: there 
exists an algorithm that recovers the input of the DNN from 
its output; the DNN as a whole distort the Euclidean distances 
based on the angels of the input of the network; and the angular 
distances smaller than tt are not altered significantly by the 
network. 

Note, however, that Theorem|6]does not apply to Theorem[T] 
In order to do that for the later, we need also a version 
of Theorem [T] that guarantees a stable embedding using the 
same metric at the input and the output of a given layer, 
e.g., embedding from the Hamming cube to the Hamming 
cube or from to Indeed, we have exactly such a 

guarantee in Corollary |5] that implies stable embedding of the 
Euclidean distances in each layer of the network. Though this 
corollary focuses on the particular case of the ReLU, unlike 
Theorem [1] that covers more general activation functions, it 
implies stability for the whole network in the Lipschitz sense, 
which is even stronger than stability in the Gromov-Hausdorff 
sense that we would get by having the generalization of 
Theorem [T] 

As an implication of Theorem |6] consider low-dimensional 
data admitting a Gaussian mixture model (GMM) with L 
Gaussians of dimension k or a fc-sparse represention in a given 
dictionary with L atoms. Eor GMM, the covering number is 
N^{K) = L (l + 1)^ for e < 1 and 1 otherwise (see 1311 ). 
Therefore we have that lo{K) < C\/k + log L and that at each 
layer the Gaussian mean width grows at most by a factor of 
1 - 1-0 ^ ^ the case of sparsely representable data, 

N^{K) = (^) (l -f 1)^. By Stirling’s approximation we have 
(fe) < therefore ui{K) < CyJk\og{L/k). Thus, at 

he Gaussian mean width grows at most by a factor 

i/fclog(L/fc) \ 

\An / ■ 


each layer 
of 1 -f O ( 


V. Training Set Size 

An important question in deep learning is what is the 
amount of labeled samples needed at training. Using Sudakov 
minoration am, one can get an upper bound on the size of 
a covering of K, N^{K), which is the number of balls of 
radius e that include all the points in K. We have demon¬ 
strated that networks with random Gaussian weights realize 
a stable embedding; consequently, if a network is trained 
using the screening technique by selecting the best among 
many networks generated with random weights as suggested 
in a, a, a, then the number of data points needed in 
order to guarantee that the network represents all the data is 
0(exp(w(iT)^/e^)). Since io{KY is a proxy for the intrinsic 
data dimension as we have seen in the previous sections 
(see ll2^ for more details), this bound formally predicts that 
the number of training points grows exponentially with the 
intrinsic dimension of the data. 

The exponential dependency is too pessimistic, as it is often 
possible to achieve a better bound on the required training 
sample size. Indeed, the bound developed in ISl requires 
much less samples. As the authors study the data recovery 
capability of an autoencoder, they assume that there exists a 


‘ground-truth autoencoder’ generating the data. A combination 
of the data dimension bound here provided, with a prior on 
the relation of the data to a deep network, should lead to a 
better prediction of the number of needed training samples. In 
fact, we cannot refrain from drawing an analogy with the field 
of sparse representations of signals, where the combined use 
of the properties of the system with those of the input data led 
to works that improved the bounds beyond the naive manifold 
covering number (see ll50l and references therein). 

The following section presents such a combination, by 
showing empirically that the purpose of training in DNN is 
to treat boundary points. This observation is likely to lead to 
a significant reduction in the required size of the training data, 
and may also apply to active learning, where the training set 
is constructed adaptively. 


VI. The Role of Training 


The proof of Theorem |3] provides us with an insight on the 
role of training. One key property of the Gaussian distribution, 
which allows it to keep the ratio between the angles in the data, 
is its rotational invariance. The phase of a random Gaussian 
vector with i.i.d. entries is a random vector with a uniform 
distribution. Therefore, it does not prioritize one direction in 
the manifold over the other but treats all the same, leading to 
the behavior of the angles and distances throughout the net 
that we have described above. 

In general, points within the same class would have small 
angles within them and points from distinct classes would 
have larger ones. If this holds for all the points, then random 
Gaussian weights would be an ideal choice for the network 
parameters. However, as in practice this is rarely the case, an 
important role of the training would be to select in a smart way 
the separating hyper-planes induced by M in such a way that 
the angles between points from different classes are ‘penalized 
more’ than the angles between the points in the same class. 

Theorem [3 and its proof provide some understanding of 
how this can be done by the learning process. Consider the 
expectation of the Euclidean distance between two points x 
and y at the output of a given layer. It reads as (the derivation 
appears in Appendix IaTi 


E[||p(Mx) -p(My)|| 2 ] = i 


|x||2 + ^ llyllz 


( 11 ) 


f.7r-Z(x,y) 


sin(0) sin(0 -|- Z(x, y))d6. 


Note that the integration in this formula is done uniformly 
over the interval [0,7r — Z(x, y)], which contains the range 
of directions that have simultaneously positive inner products 
with X and y. With learning, we have the ability to pick the 
angle 6 that maximizes/minimizes the inner product based on 
whether x and y belong to the same class or to distinct classes 
and in this way increase/decrease their Euclidean distances at 
the output of the layer. 

Optimizing over all the angles between all the pairs of 
the points is a hard problem. This explains why random 
initialization is a good choice for DNN. As it is hard to 
find the optimal configuration that separates the classes on the 
manifold, it is desirable to start with a universal one that treats 
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Fig. 6. Ratios of closest inter- (left) and farthest intra-class (right) class 
Euclidean (top) and angular (bottom) distances for CIFAR-10. For each data 
point we calculate its Euclidean distance to the farthest point from its class 
and to the closest point not in its class, both at the input of the DNN and at 
the output of the last convolutional layer. Then we compute the ratio between 
the two, i.e., if x is the point at input, y is its faithest point in class, x is the 
point at the output, and z is its farthest point from the same class (it should 
not necessarily be the output of y), then we calculate and ■ 2 ^^- 

We do the same for the distances between different classes, compaiing the 
shortest Euclidean and angular distances. 


Fig. 7. Differences of closest inter- (left) and farthest intra- (right) class 
Euclidean (top) and angular (bottom) distances for CIFAR-10 (a counterpart 
of Fig. [6| with distance ratios replaced with differences). For each data point 
we calculate its Euclidean distance to the farthest point from its class and to 
the closest point not in its class, both at the input of the DNN and at the output 
of the last convolutional layer. Then we compute the difference between the 
two, i.e., if X is the point at input, y is its farthest point in class, x is the 
point at the output, and z is its farthest point from the same class (it should 
not necessarily be the output of y), then we calculate ||x — z ||2 — ||x — y ||2 
and Z(x, z) — Z(x,y). We do the same for the distances between different 
classes, comparing the shortest Euclidean and angular distances. 


most of the angles and distances well, and then to correct it 
in the locations that result in classification errors. 

To validate this hypothesized behavior, we trained two DNN 
on the MNIST and CIFAR-10 datasets, each containing 10 
classes. The training of the networks was done using the 
matconvnet toolbox EH. The MNIST and CIFAR-10 networks 
were trained with four and hve layers, respectively, followed 
by a softmax operation. We used the default settings provided 
by the toolbox for each dataset, where with CIFAR-10 we 
also used horizontal mirroring and 64 biters in the brst two 
layers instead of 32 (which is the default in the example 
provided with the package) to improve performance. The 
trained DNN achieve 1% and 18% errors for MNIST and 
CIFAR-10 respectively. 

For each data point we calculate its Euclidean and angular 
distances to its farthest intra-class point and to its closest 
inter-class point. We compute the ratio between the distances 
at the output of the last convolutional layer (the input of 
the fully connected layers) and the ones at the input. Let 
X be the point at the input, y be its farthest point from 
the same class, x be the point at the output, and z be its 
farthest point from the same class (it should not necessarily 
be the output of y), then we calculate ||^_y||^ for Euclidean 
distances and Z(x, z)/Z(x, y) for the angles. We do the same 
for the distances between different classes, comparing the 


shortest ones. Eig. |6] presents histograms of these distance 
ratios for CIEAR-10. In Eig. [T] we present the histograms 
of the differences of the Euclidean and angular distances, 
i.e., ||x--z ||2 - ||x-y ||2 and Z(k,z) - Z(x,y). We also 
compare the behavior of all the inter and intra-class distances 
by computing the above ratios for all pairs of points (x, y) in 
the input with respect to their corresponding points (x, y) at 
the output. These ratios are presented in Eig.0 We present also 
the differences ||x - y ||2 - ||x - y ||2 and Z(x, y) - Z(x, y) 
in Eig. |9] We present the results for three trained networks, in 
addition to the random one, denoted by Netl, Net2 and Net3. 
Each of them corresponds to a different amount of training 
epochs, resulting with a different classibcation error. 

Considering the random DNN, note that all the histograms 
of the ratios are centered around 1 and the ones of the 
differences around 0, implying that the network preserves most 
of the distances as our theorems predict for a network with 
random weights. Eor the trained networks, the histograms over 
all data point pairs (Eigs. 0 and |9l) change only slightly due 
to training. Also observe that the trained networks behave 
like their random counterparts in keeping the distance of a 
randomly picked pair of points. However, they distort the 
distances between points on class boundaries “better” than 
the random network (Eigs. |6] and |7]i, in the sense that the 
farthest intra class distances are shrunk with a larger factor 
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Fig. 8. Ratios of inter- (left) and intra- (right) class Euclidean (top) and 
angulai' (bottom) distances between randomly selected points for CIFAR-10. 
We calculate the Euclidean distances between randomly selected pairs of data 
points from different classes (left) and from the same class (right), both at 
the input of the DNN and at the output of the last convolutional layer. Then 
we compute the ratio between the two, i.e., for all pairs of points (x, y) in 
the input and their corresponding points (x, y) at the output we calculate 
"" -yll2 Ax,y) 
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Fig. 9. Differences of inter- (left) and intra- (right) class Euclidean (top) 
and angulai' (bottom) distances between randomly selected points for CIFAR- 
10 (a counterpart of Fig. [8] with distance ratios replaced with differences). 
We calculate the Euclidean distances between randomly selected pairs of data 
points from different classes (left) and from the same class (right), both at 
the input of the DNN and at the output of the last convolutional layer. Then 
we compute the difference between the two, i.e., for all pairs of points (x, y) 
in the input and their corresponding points (x, y) at the output we calculate 
l|x- ylla - l|x- y ||2 and Z(x,y) - Z(x,y). 


than the ones of the random network, and the closest inter 
class distances are set farther apart by the training. Notice that 
the shrinking of the distances within the class and enlargement 
of the distances between the classes improves as the training 
proceeds. This confirms our hypothesis that a goal of training 
is to treat the boundary points. 

A similar behavior can be observed for the angles. The clos¬ 
est angles are enlarged more in the trained network compared 
to the random one. However, enlarging the angles between 
classes also causes the enlargement of the angles within the 
classes. Notice though that these are enlarged less than the 
ones which are outside the class. Finally, observe that the 
enlargement of the angles, as we have seen in our theorems, 
causes a larger distortion in the Euclidean distances. Therefore, 
we may explain the enlargement of the distances in within the 
class as a means for shrinking the intra-class distances. 

Similar behavior is observed for the MNIST dataset. How¬ 
ever, the gaps between the random network and the trained 
network are smaller as the MNIST dataset contains data which 
are initially well separated. As we have argued above, for such 
manifolds the random network is already a good choice. 

We also compared the behavior of the validation data, of 
the ImageNet dataset, in the network provided by 13^ and in 
the same network but with random weights. The results are 
presented in Figs. [Tol [H] [12] and [13] Behavior similar to the 
one we observed in the case of CIFAR-10, is also manifested 
by the ImageNet network. 


VII. Discussion and Conclusion 

We have shown that DNN with random Gaussian weights 
perform a stable embedding of the data, drawing a connec¬ 
tion between the dimension of the features produced by the 
network that still keep the metric information of the original 
manifold, and the complexity of the data. The metric preser¬ 
vation property of the network provides a formal relationship 
between the complexity of the input data and the size of 
the required training set. Interestingly, follow-up studies 15211 . 
l5^ found that adding metric preservation constraints to the 
training of networks also leads to a theoretical relation between 
the complexity of the data and the number of training samples. 
Moreover, this constraint is shown to improve in practice the 
generalization error, i.e., improves the classification results 
when only a small number of training examples is available. 

While preserving the structure of the initial metric is 
important, it is vital to have the ability to distort some of 
the distances in order to deform the data in a way that the 
Euclidean distances represent more faithfully the similarity 
we would like to have between points from the same class. 
We proved that such an ability is inherent to the DNN 
architecture; the Euclidean distances of the input data are 
distorted throughout the networks based on the angles between 
the data points. Our results lead to the conclusion that DNN 
are universal classifiers for data based on the angles of the 
principal axis between the classes in the data. As these are 
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Fig. 10. Ratios of closest inter- (left) and farthest intra- (right) class 
Euclidean (top) and angular (bottom) distances for ImageNet. For each data 
point we calculate its Euclidean distance to the farthest point from its class 
and to the closest point not in its class, both at the input of the DNN and at 
the output of the last convolutional layer. Then we compute the ratio between 
the two, i.e., if x is the point at input, y is its faithest point in class, x is the 
point at the output, and z is its farthest point from the same class (it should 
not necessarily be the output of y), then we calculate and . 

We do the same for the distances between different classes, comparing the 
shortest Euclidean and angular distances. 


Fig. 11. Differences of closest inter- (left) and farthest intra- (right) class 
Euclidean (top) and angular (bottom) distances for ImageNet (a counterpart of 
Fig. 03 with distance ratios replaced with differences). For each data point we 
calculate its Euclidean distance to the farthest point from its class and to the 
closest point not in its class, both at the input of the DNN and at the output 
of the last convolutional layer. Then we compute the difference between the 
two, i.e., if X is the point at input, y is its farthest point in class, x is the 
point at the output, and z is its farthest point from the same class (it should 
not necessarily be the output of y), then we calculate ||x--z ||2 — ||x — y ||2 
and Z(x,z) — Z(x,y). We do the same for the distances between different 
classes, compai'ing the shortest Euclidean and angular distances. 


not the angles we would like to work with in reality, the 
training of the DNN reveals the actual angles in the data. In 
fact, for some applications it is possible to use networks with 
random weights at the first layers for separating the points 
with distinguishable angles, followed by trained weights at 
the deeper layers for separating the remaining points. This is 
practiced in the extreme learning machines (ELM) techniques 
||54l and our results provide a possible theoretical explanation 
for the success of this hybrid strategy. 

Our work implies that it is possible to view DNN as a 
stagewise metric learning process, suggesting that it might be 
possible to replace the currently used layers with other metric 
learning algorithms, opening a new venue for semi-supervised 
DNN. This also stands in line with the recent literature on 
convolutional kernel methods (see ll55l , ||5^ ). 

In addition, we observed that a potential main goal of the 
training of the network is to treat the class boundary points, 
while keeping the other distances approximately the same. 
This may lead to a new active learning strategy for deep 
learning ll57ll . 
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Appendix A 
Proof of Theorem[3] 

Before we turn to prove Theorem [3 we present two propo¬ 
sitions that will aid us in its proof. The first is the Gaussian 
concentration bound that appears in ll48l Equation 1.6]. 

Proposition 7: Let g be an i.i.d. Gaussian random vector 
with zero mean and unit variance, and 77 be a Lipschitz- 
continuous function with a Lipschitz constant Cjy. Then for 
every a > 0, with probability exceeding 1 — 2 exp(—a^/ 2 cry), 

|? 7 (g)-E[r 7 (g)]| < a. (12) 

Proposition 8: Let m G M” be a random vector with zero- 
mean i.i.d. Gaussian distributed entries with variance —, and 
iL C B" be a set with a Gaussian mean width w{K). Then, 
with probability exceeding 1 — 2 exp(—w(iT)^/4), 

sup (p(m^x) — p(m^y))^ < 4— —(13) 

x,yeK fTL 

Proof: First, notice that from the properties of the ReLU p, it 
holds that 

|p(m’^x) — p(m’^y)| < |m’^x — m^y| = |m^(x — y)| . (14) 

Let g = ^/rn-m be a scaled version of m such that each entry 
in g has a unit variance (as m has a variance ^). From the 
Gaussian mean width charachteristics (see Proposition 2.1 in 
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Fig. 12. Ratios of inter- (left) and intra- (right) class Euclidean (top) and 
angulai' (bottom) distances between randomly selected points for ImageNet. 
We calculate the Euclidean distances between randomly selected pairs of data 
points from different classes (left) and from the same class (right), both at 
the input of the DNN and at the output of the last convolutional layer. Then 
we compute the ratio between the two, i.e., for all pairs of points (x, y) in 
the input and their con'esponding points (x, y) at the output we calculate 

l|x-y||2 ^(x,y)- 


Fig. 13. Differences of inter- (left) and intra- (right) class Euclidean (top) and 
angulai' (bottom) distances between randomly selected points for ImageNet (a 
counterpart of Fig. \M with distance ratios replaced with differences). We 
calculate the Euclidean distances between randomly selected pairs of data 
points from different classes (left) and from the same class (right), both at 
the input of the DNN and at the output of the last convolutional layer. Then 
we compute the difference between the two, i.e., for all pairs of points (x, y) 
in the input and their corresponding points (x, y) at the output we calculate 
l|x - ylla - l|x- ylla and Z(x,y) - Z(x,y). 


ISU), we have Esup,;. .yeiv |g^(x - y)| = w{K). Therefore, 
combining the Gaussian concentration bound in Proposition |7] 
together with the fact that sup^ |g^(x — y)| is Lipschitz- 
continuous with a constant Crj = 2 (since K C B"), we have 


sup g^(x-y) -u){K) 
x.yetV 

with probability exceeding (1 — 2exp( 
implies 


< a, (15) 

—a^/4)). Clearly, (fTSI l 


sup g^(x-y) < 2w(iT), (16) 

x.yerv 

where we set a = w(iC). Combining (fThl l and (fT4l i with the 
fact that 


sup (p(g^x) - p(g^y))^ = ( sup (p(g'^x) - p(g^y)))2 

x.yGiV x.yGiV 


leads to 

sup (p(g^x) - p(g^y))^ < 4u;(iT)^. (17) 

x.yGiV 

Dividing both sides by m completes the proof. □ 

Proof of Theorem |5} Our proof of Theorem [3] consists of 
three keys steps. In the first one, we show that the bound in 
© holds with high probability for any two points x, y € iC. 
In the second, we pick an e-cover for K and show that the 
same holds for each pair in the cover. The last generalizes the 
bound for any point in K. 


Bound for a pair x, y G K\ Denoting by m, the i-th 
column of M, we rewrite 


IIp(Mx) - p(My )||2 = ^ (p(mfx) - p{mjy)f . (18) 


Notice that since all the mi have the same distribution, the 
random variables (p(m^x) -p(mfy)) are also equally- 
distributed. Therefore, our strategy would be to calculate the 
expectation of these random variables and then to show, using 
Bernstein’s inequality, that the mean of these random variables 
does not deviate much from their expectation. 

We start by calculating their expectation 


E 


(p(mfx) -p(mfy))' 

= E (p(mfx))^ -f E (p(mfy))' 


(19) 

2Ep(mfx)p(mfy). 


For calculating the first term at the right hand side (rhs) note 
thatmfx is a random Gaussian vector with variance ||x|| 2 /m. 
Therefore, from the symmetry of the Gaussian distribution we 
have that 

E (p(mfx))^ = ^E (mfx)^ = ^ ||x ||2 . (20) 

In the same way, E(p(m^y))^ = ^ ||y|l 2 - For calculating 
the third term at the rhs of ( fT^ , notice that p(m^x)p(mfy) 
is non-zero if both the inner product between and x and 
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the inner product between nii and y are positive. There¬ 
fore, the smaller the angle between x and y, the higher 
the probability that both of them will have a positive inner 
product with m.i. Using the fact that a Gaussian vector is 
uniformly distributed on the sphere, we can calculate the 
expectation of p(mfx)p(mf y) by the following integral, 
which is dependent on the angle between x and y; 


E[p(mfx)p(mfy)] = 


( 21 ) 


^ir-Z(x,y) 


rmr 


sin(0) sin(0 + Z(x, y))d6 


l|x|| ||y|| 


(sm(Z(x, y)) - cos(Z(x, y))Z(x, y) - tt) . 


Using Proposition [8] the fact that iT G B" and the behavior 
of r/i(x, y) (see Fig. |2]i together with the triangle inequal¬ 
ity imply that M < with probability exceeding 

(1 — 2exp(—cu(K)^/4)). Plugging this bound with the one 
we computed for into (l24l) leads to 


F 



< exp ( — 


mS^/8 


2.12-p 4(5a;(iT)V3-f (5 


-|-2exp(-w(iT)2/4), 


(26) 


where we included the probability of Proposition [8] in the 
above bound. Since by the assumption of the theorem m = 
we can write 


Having the expectation of all the terms in ( [19] ) calculated, we 
define the following random variable, which is the difference 
between (p(m^x) — p(mfy)) and its expectation, 

Zi = (p(mfx) - p(mfy))^ “ ^ “ ^^2 (22) 

+ Il^ll^j|y|l2 ( sin(Z(x, y)) - cos(Z(x, y))Z(x, y)) . 

Clearly, the random variable Zi is zero-mean. To finish the first 

step of the proof, it remains to show that the sum 

not deviate much from zero (its mean). First, note that 


Bound for all x, y G N^{K)\ Let N^{K) be an e cover 
for K. By using a union bound we have that for every parr in 

N,{K), 

(28) 

By Sudakov’s inequality we have log|7Ve(iT)| < ce~‘^w{K)'^. 
Plugging this inequality into ([28]) leads to 


P 




< P 



(23) 


P 



< Cexp 


mS^ 

4w{K)'^ 


+ ce ^w{K) 


■ (29) 


and therefore it is enough to bound the term on the rhs of 
([22). By Bernstein’s inequality, we have 



Zi\>t \ < exp - 


fV2 


E::iEz2 + Mf/3 


(24) 


where M is an upper bound on \zi\. To calculate 
Ez|, one needs to calculate the fourth moments 
E(p(mfx))^ and E(p(mfy))'^, which is easy to compute 
by using the symmetry of Gaussian vectors, and the 
correlations Ep(mfx)^p(mfy), Ep(mfx)p(mfy)^ and 
Ep(mfx)2p(mfy)2. For calculating the later, we use as 
before the fact that a Gaussian vector is uniformly distributed 
on the sphere and calculate an integral on an interval which 
is dependent on the angle between x and y. For example. 


E = iWia 


(25) 


^7r-Z(x,y) 


sm^{6) sin(0 -b Z(x, y))dO, 


where 9 is the angle between nij and x. We have a similar 
formula for the other terms. By simple arithmetics and using 
the fact that K G B", we have that Ezf < 2.1/m^. 

The type of formula in (i25l ). which is similar to the one in 
(ED, provides an insight into the role of training. As random 
layers ‘integrate uniformly’ on the interval [0, tt — Z(x, y)], 
learning picks the angle 9 that maximizes/minimizes the inner 
product based on whether x and y belong to the same class 
or to distinct classes. 


Setting e > -^S, we have by the assumption m > CS~^uj{K)‘^ 
that the term in the exponent at the rhs of ( l29b is negative and 
therefore the probability decays exponentially as m increases. 

Bound for aU x, y G K: Let us rewrite x, y G iL as Xp-I-Xe 
and yo-bye, with xo,yo G N^{K) andxe,ye G {K-K)r\M'^, 
where K — K = {x — y:x, yGiT}. We get to the desired 
result by setting e < ^<5 and using the triangle inequality 
combined with Proposition 0 to control p(Mxc) and p(Mye), 
the fact that w{K — K) < 2w{K) and the Taylor expansions 
of the cos(-), sin(-) and cos“^(-) functions to control the terms 
related to r/) (see (|2). □ 


Appendix B 
Proof of Theorem|4] 

Proof: Instead of proving Theorem |4| directly, we deduce it 
from Theorem 12 First we notice that (|4]( is equivalent to 


1 ||p(Mx)||^ + 1 ||p(My)||^ - 1 ||x||^ - i ||y||^ (30) 


- (^p(Mx)^p(My) - ■ii^^M^cos(Z(x,y)) - 
(sin(Z(x,y)) - cos(Z(x,y))Z(x,y))) 



As £’||p(Mx )||2 = 5 ||x|| 2 , we also have that with high 
probability (like the one in Theorem |2, 

||p(Mx)||^-i||x||^ 


< 5, Vx G K. 


(31) 
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(The proof is very similar to the one of Theorem|3ll. Applying 
the reverse triangle inequality to (|30] | and then using dSTl ). 


followed by dividing both sides by lEiMZlk ^ jgad to 


' 2p(Mx)^p(My) 

' I|x|l2l|y|l2 


- cos(Z(x,y)) - 


(32) 


1 


-(sin(Z(x,y)) - cos(Z(x, y))Z(x, y 


TT 


< 


3(5 


b iiyii2 

Using the reverse triangle inequality with (I32l i leads to 

\T, 


( p(Mx)^ p(My) 

( II ti\/r Ml II (-A/r Ml “ cos(Z(x, y)) - 
V||p(Mx)||2 ||p(My)||2 

-(sin(Z(x,y)) - cos(Z(x,y))Z(x,y))) < 


(33) 


3(5 


+ 


2p(Mx)'^p(My) p(Mx)’^p(My) 


||x||2||y||2 ||p(Mx)||2||p(My)||2 

To complete the proof it remains to bound the rhs of ( l33l l. For 
the second term in it, we have 

2p(Mx)’^p(My) p(Mx)^p(My) 


ll^lla llyllz 

2p(Mx)'^p(My) 

l|x||2 I|y|l2 


||p(Mx )||2 ||p(My)||. 


= (34) 


1 - 


2||p(Mx)||2||p(My)||, 


Because iT C B” \ /SBJ, it follows from dSTl i that 
||p(Mx)||^>i/32-(5. 

Dividing by (|||p(Mx )||2 + ^ ||x|| 2 ) ||p(Mx)|| 
of (ISTT i and then using (iTSl l and the fact that ||x ||2 > /3, provide 


(35) 


both sides 


1 - 


y2||p(Mx)||. 


(36) 


< 


< 


I2 V2 
(5 


||p(Mx)||. 




< 


:,Vxe JT, 


where the last inequality is due to simple arithmetics. Using 
the triangle inequality and then the fact that the inequality in 
holds Vx G K, we have 


1 - 


2||p(Mx)||2||p(My)||, 


< 


1 - 


|y|l2 


v^llp(My)|| 


+ 


1 - 


< 


y2||p(Mx)||, 
:(!■ 


2 

y|l2 


(37) 


v^llp(My)||2 

3b 


< 


^ -25 ' ( 3 ^- 25 '-^ ' /32 - 25 ' “ - 25 ' 

Combining (l3^ with the facts that cos(Z(x, y)) + 
2 (sin(Z(x, y)) — cos(Z(x, y))Z(x, y) is bounded by 1 (see 
Fig. IS and a: C 1? \ /31^ lead to 

2p(Mx)'^p(My) 


, ^ 3b 


(38) 


Plugging ( l38l l and ( iTTl i into ( l34b and then the outcome into 
® lead to the bound + ||^ 
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